Indexing Schemes for Similarity Search In Datasets of Short Protein Fragments

نویسندگان

  • Aleksandar Stojmirovic
  • Vladimir Pestov
چکیده

We propose a family of very efficient hierarchical indexing schemes for ungapped, score matrixbased similarity search in large datasets of short (4-12 amino acid) protein fragments. This type of similarity search has importance in both providing a building block to more complex algorithms and for possible use in direct biological investigations, and datasets can have on the order of 60 million objects. Our scheme is based on the internal geometry of the aminoacid alphabet and performs exceptionally well, for example outputting 100 nearest neighbours to any possible fragment of length 10 after scanning on average less than one per cent of the entire dataset.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Indexing Schemes for Similarity Search: an Illustrated Paradigm

We suggest a variation of the Hellerstein— Koutsoupias—Papadimitriou indexability model for datasets equipped with a similarity measure, with the aim of better understanding the structure of indexing schemes for similarity-based search and the geometry of similarity workloads. This in particular provides a unified approach to a great variety of schemes used to index into metric spaces and facil...

متن کامل

Quasi-metrics, Similarities and Searches: aspects of geometry of protein datasets

A quasi-metric is a distance function which satisfies the triangle inequality but is not symmetric: it can be thought of as an asymmetric metric. Quasi-metrics were first introduced in 1930s and are a subject of intensive research in the context of topology and theoretical computer science. The central result of this thesis, developed in Chapter 3, is that a natural correspondence exists betwee...

متن کامل

یک روش مبتنی بر خوشه‌بندی سلسله‌مراتبی تقسیم‌کننده جهت شاخص‌گذاری اطلاعات تصویری

It is conventional to use multi-dimensional indexing structures to accelerate search operations in content-based image retrieval systems. Many efforts have been done in order to develop multi-dimensional indexing structures so far. In most practical applications of image retrieval, high-dimensional feature vectors are required, but current multi-dimensional indexing structures lose their effici...

متن کامل

Curse of Dimensionality in the Application of Pivot-based Indexes to the Similarity Search Problem

In this work we study the validity of the so-called curse of dimensionality for indexing of databases for similarity search. We perform an asymptotic analysis, with a test model based on a sequence of metric spaces (Ωd) from which we pick datasets Xd in an i.i.d. fashion. We call the subscript d the dimension of the space Ωd (e.g. for R d the dimension is just the usual one) and we allow the si...

متن کامل

Fuzzy retrieval of encrypted data by multi-purpose data-structures

The growing amount of information that has arisen from emerging technologies has caused organizations to face challenges in maintaining and managing their information. Expanding hardware, human resources, outsourcing data management, and maintenance an external organization in the form of cloud storage services, are two common approaches to overcome these challenges; The first approach costs of...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Inf. Syst.

دوره 32  شماره 

صفحات  -

تاریخ انتشار 2007